harmful instruction
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Yang, Xikang, Zhou, Biyu, Tang, Xuehai, Han, Jizhong, Hu, Songlin
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Yang, Yudong, Zhang, Xuezhen, Han, Zhifeng, Wang, Siyin, Zhuang, Jimin, Jin, Zengrui, Shao, Jing, Sun, Guangzhi, Zhang, Chao
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Information Technology > Security & Privacy (0.46)
- Government > Military (0.46)
Chain-of-Thought Hijacking
Zhao, Jianli, Fu, Tingchen, Schaeffer, Rylan, Sharma, Mrinank, Barez, Fazl
Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Information Technology > Security & Privacy (1.00)
- Law Enforcement & Public Safety > Terrorism (0.94)
- Law (0.89)
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
Yang, Guan-Yan, Cheng, Tzu-Yu, Teng, Ya-Wen, Wanga, Farn, Yeh, Kuo-Hui
The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure's content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
HauntAttack: When Attack Follows Reasoning as a Shadow
Ma, Jingyuan, Li, Rui, Li, Zheng, Liu, Junfeng, Xia, Heming, Sha, Lei, Sui, Zhifang
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England (0.04)
- Europe > France (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Overview (0.93)
- Law Enforcement & Public Safety (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government > Military (0.88)
VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands
Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
Kim, Jaehan, Song, Minkyoo, Shin, Seungwon, Son, Sooel
Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Lee, DongGeon, Jang, Joonwon, Jeong, Jihae, Yu, Hwanjo
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- Asia > Russia (0.14)
- (17 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
Mitigating Jailbreaks with Intent-Aware LLMs
Yeo, Wei Jie, Satapathy, Ranjan, Cambria, Erik
Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. With the rapid advancement of large language models (LLMs) (Grattafiori et al., 2024; Y ang et al., 2025; Liu et al., 2024; Mao et al., 2024), the risk of these models executing harmful or catastrophic instructions has grown correspondingly (Anthropic, 2025). This is largely managed by efforts such as dedicated a safety-alignment stage (Ouyang et al., 2022), aiming to ensure that LLMs are not only helpful but also consistently generate safe and ethical outputs. Nevertheless, recent findings by Qi et al. (2024) expose a fundamental vulnerability in prevailing safety-alignment practices: Shallow Alignment . In particular, alignment in most models is largely superficial--constrained to surface-level refusals--resulting in safe outputs that are often limited to generic templates such as "I am sorry but... " or "As a language model... " . This superficial alignment permits attackers to circumvent safety mechanisms by explicitly instructing the model to avoid generating commonly recognized refusal responses (Tang, 2024; Andriushchenko et al., 2025). Furthermore, LLMs remain susceptible to a broader range of prompt-based attacks, including those that optimize over discrete suffix tokens (Zou et al., 2023; Basani & Zhang, 2025) or rephrase harmful instructions to look harmless (Chao et al., 2025; Zeng et al., 2024). Beyond initial safety alignment, practitioners have developed a range of inference-time defenses, such as prompting models to adhere to their safety guidelines (Xie et al., 2023) incorporating additional safety exemplars to enable in-context defense (Wei et al., 2023). Wang et al. (2024) introduce a backdoor trigger into safety-aligned LLMs, serving as a covert prefix that elicits safety responses when detected, without affecting model behavior on benign queries. In recent works, Zhang et al. (2024) introduced a dual-stage prompting strategy, Intention Analysis (IA), which encourages LLMs to analyze the intent behind an instruction prior to generating a safe response.
- Government > Military (0.69)
- Information Technology > Security & Privacy (0.47)
PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.